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Abstract: The problem of constructing confidence sets in the high dimen- 
sional linear model with n response variables and p parameters, possibly 
p > n, is considered. Necessary and sufficient conditions for the existence of 
confidence sets that adapt to the unknown sparsity of the parameter vector 
are given in terms of ^-separation conditions. These are derived from a 
minimax analysis of closely related composite testing problems. The design 
conditions cover common coherence assumptions used in models for sparse 
inference, such as Gaussian and sub-Gaussian designs. The results imply in 
particular that sparse confidence sets exist only over strict subsets of the 
parameter spaces for which sparse estimators exist. Qualitative differences 
between the highly and moderately sparse case are shown to exist, and the 
case of p < n is analysed separately, where a transition to the theory of 
adaptive confidence sets in standard nonparamctric and parametric models 
is exhibited. Concrete inferential procedures that can be used over maximal 
parameter spaces are discussed. 
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Dedicated to the memory of Yuri I. Ingster 

1. Introduction 



Consider the linear model 

Y = X9 + s (1) 

where X is a n x p matrix, 9 £ R p , potentially p > n, and where e is a n x 1 vector 
consisting of i.i.d. Gaussian noise with mean zero and known variance standard- 
ised to one. To develop the main ideas, let us assume for the moment that the 

1 
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design is random, and that the matrix X consists of i.i.d. N(0, 1) Gaussian 
entries (Xij), all independent of e, reflecting a prototypical high-dimensional 
model, such as those encountered in compressive sensing; our main results hold 
for more general design assumptions that we introduce and discuss in detail be- 
low. We denote by Pg the law of (Y,X), by Eg the corresponding expectation, 
and will omit the subscript 9 when no confusion may arise. For the asymptotic 
analysis we shall let min(n,p) tend towards infinity, and the o, O- notation is to 
be understood accordingly. 

We denote by Bo(k) an £°-'bah" of radius k in R p , so all vectors in W with at 
most k < p nonzero entries. As common in the literature on high-dimensional 
models, we shall consider p potentially greater than n but signals 9 that are 
sparse in the sense that 9 £ B^ik) for some k significantly smaller than p, 
typically k < n, so that consistent estimation of 9 is still possible. We set 

fc = fc(/3)-p 1 - /3 ,0</3<l. 

The parameter ft measures the sparsity of the signal: If ft is close to one only 
very few of the p coefficients of 9 are nonzero. If ft £ (0, 1/2] one speaks of the 
moderately sparse case and for ft £ (1/2, 1] of the highly sparse case. We include 
the case ft = 1 where, by convention, k = const x p° = const. 

A sparse adaptive estimator 9 = 9 np = 9(Y, X) for 9 achieves for every n, every 
k < p, some universal constant c and with high Pg-probability, the risk bound 

\\9-9\\ 2 <clogpx -, (2) 
n 

uniformly for all 9 £ B (k). Here | • || = || ■ || 2 denotes the standard Euclidean 
norm on R p , with inner product (-,-). Such estimators exist (see Corollary 2 
below for example) - they attain the risk of an estimator that would know the 
positions of the k nonzero coefficients, with the mild penalty of log p. The lit- 
erature on such estimators is abundant, see, for instance, Candes and Tao [2007], 
Bickel et al. [2009], and the monograph Biihlmann and van dc Gcer [2011], where 
many further references can be found. 

We are interested in the question of whether one can construct a confidence set 
for 9 that takes inferential advantage of sparsity as in (2). Most of what follows 
applies to the related problem of constructing confidence sets for X9 as well, we 
discuss this briefly at the end of the introduction. A confidence set C = C np is a 
random subset of W - depending only on the sample Y, X and on a significance 
level < a < 1 - that we require to contain the true parameter 9 with at least 
a prescribed probability 1 — a. We shall consider a minimal degree of sparsity 

fcr-p 1 -^, 

where we may choose fti £ (0, 1) as we wish - our statistical procedure should 
have coverage over signals that are at least /3i-sparse, and decreasing fti makes 
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this requirement more difficult. Given a, any level a- confidence set C should 
then be asymptotically honest over Bo(ki), that is, it should satisfy 

liminf inf P e (6 G C) > 1 - a. (3) 

min(n,p)— ^oo 0£i3o(fci) 

Moreover, if we measure the diameter of C in a natural way by the loss function 
from (2) we should require that, if | C| 2 is the random || • ||-radius of the smallest 
Euclidean ball that contains C, then for every a' > there exists a universal 
constant L such that for every < k < k\ , 

limsup sup Pg (\C\l > Llogp x — j < a'. (4) 

min(n,p)->oo 8eB (k) \ n J 

Such a confidence set would cover the true 9 with prescribed probability, and 
would shrink at an optimal rate for fc-sparse signals without requiring knowledge 
of the position of the k nonzero coefficients. Our analysis below will imply that 
a confidence set that simultaneously satisfies (3) and (4) does not exist. This is 
so despite the existence of estimators satisfying (2); the construction of sparse 
confidence sets is thus a qualitatively different problem than that of sparse 
estimation. 

In our analysis we shall follow the separation approach to adaptive confidence 
sets introduced in Gine and Nickl [2010], Hoffmann and Nickl [2011], Bull and Nickl 
[2012] in the framework of nonparametric function estimation. We shall attempt 
to make honest inference over maximal subsets of £?o(fci) where k\ is given a 
priori as above, in a way that is adaptive over the submodel of sparse vectors 9 
that belong to B {k ), 

ko-p 1 - 00 , ko<k u Po>p 1 . 

We shall remove those 9 G £>o(fci) that are too close in Euclidean distance to 
B (k ), and consider 

B (h,p) = {0 G B Q (h) :\\8- B (ko)\\ > p} (5) 

where p = p np is a separation sequence, and where \\6 — Z\\ = mi ze z \\0 — z\\ for 
any Z C W. Thus, if 9 ^ Bo(ko), we remove the fco coefficients 9j with largest 
modulus \9j I from 9, and require a lower bound on the ^ 2 -norm of the remaining 
subvector. In other words, if |#(i)| < ■ ■ ■ < < •"" < l^fril are an y or der 

statistics of then 

H0-iWii 2 = XX) 

needs to exceed p 2 . Defining 

G(p) = B (k ) U B a (k u p) 
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as our new model, we now require, instead of (3) and (4), the weaker coverage 
property 

liminf inf P e {6 € C np ) > 1 - a, (6) 

min(n.p)— > oo 9£&(p rlp ) 

as well as, for some finite constant L > 0, 

limsup sup P g ( \Cnp\l > Llogp X — ) < a', (7) 

min(n,p) 

and 

limsup sup P e ( \C np \j > Llogp x — ) < a', (8) 

and search for minimal assumptions on the separation sequence p np . Note that 
any confidence set C that satisfies (3) and (4) also satisfies the above three 
conditions for any p > 0, so if one can prove the necessity of a lower bound 
on the sequence p np then one disproves in particular the existence of adaptive 
confidence sets in the stronger sense of (3) and (4). 

The following, first result describes our findings in one relevant special case: 
It gives necessary and sufficient conditions on the asymptotic order of p for 
the scenario where p > n, a confidence set is asked for that adapts to highly 
sparse signals of a fixed finite dimension (/3q = 1), and with coverage required 
over moderately sparse alternatives (fix < 1/2) for which consistent estimation 
is still possible (fci = o(n/ log p)). Other sets of assumptions will be analysed 
below. 

Theorem 1. Consider the model (1) with i.i.d. Gaussian design Xij ~ N(0, 1) 
and let p > n. Let fco be any fixed positive integer, let < (3\ < 1/2, and assume 

ki ~ p 1 ^ = 



logp 



An honest adaptive confidence set C np over Q(p np ) in the sense of (6), (7), (8) 
exist if and only if p np exceeds, up to a multiplicative universal constant, n -1 / 4 , 
which is the minimax rate of testing between the composite hypotheses 

H :6eB (k ) vs. Hi : 8 e B (k 1 ,p np ). (9) 



The proof of the necessity part of Theorem 1 follows from Theorem 2 below, 
whereas the sufficiency part follows from Theorem 3. 

Theorem 1 and our further findings below impy that sparse confidence sets 
exist precisely over those parameter subspaces of P>o(ki) for which the degree 
of sparsity is asymptotically detectable. As our proofs show, sparse adaptive 
confidence sets solve the above composite testing problem in a minimax way, 
either implicitly or explicitly. The ideas of the paper Ingster et al. [2010], where 
the testing problem (9) is considered with simple Ho : = 0, are instrumental 
for our results, and we also refer to Arias-Castro et al. [2011] for related work. 
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Theorem 1 parallels the findings in nonparametric function estimation from 
Hoffmann and Nickl [2011] and Bull and Nickl [2012], and re-iterates the general 
observation that adaptive confidence sets exist over parameter spaces for which 
the structural property one wishes to adapt to - in the present case sparsity - 
can be detected from the sample. 

Before we give our more detailed results, we wish to discuss their main con- 
sequences in some detail. A first important aspect is that for p > n one al- 
ways has to separate -Bo(fco) and B$(ki) \ Bo(ko) in order to construct sparse 
adaptive confidence sets. Since we are after £ 2 -type confidence sets, this may 
come as a surprise: when constructing adaptive L 2 -type confidence balls for 
nonparametric functions certain specific situations exist where adaptation is 
possible over the full parameter space, see Li [1989], Bcran and Dumbgcn [1998], 
Hoffmann and Lepski [2002] , Baraud [2004] , Cai and Low [2006] , Robins and van der Vaart 
[2006], Bull and Nickl [2012]. As our results show, this phenomenon does not 
exist in sparse regression. As a heuristic explanation can perhaps serve the ob- 
servation that under sparsity, L 2 -norms look more like L°°-norms in the sense 
that the largest coefficient has a significant contribution to the norm, and that 
the theory should therefore reflect the L°°-situation, where separation is also al- 
ways necessary (Hoffmann and Nickl [2011]). Mathematically, the reasons why 
separation is always necessary are, roughly speaking, the following two: i) the 
in many respects natural requirement ko = o{\fnj logp) implies that the rate 
of sparse adaptive estimation is o(n -1 / 4 ), so that the construction of general, 
dimension- independent, n -1 / 4 - width adaptive confidence balls as provided in 
Li [1989], Beran and Diimbgen [1998], Baraud [2004] is not relevant for sparse 
adaptivity. ii) The approach of estimating the squared L 2 -risk of an adap- 
tive estimator proposed in Hoffmann and Lepski [2002], Cai and Low [2006], 
Robins and van der Vaart [2006] has an accuracy of p 1 ^ 4 / yfn in a general re- 
gression model, which for p > n is of larger order of magnitude than n -1 / 4 , and 
so is also not useful in high-dimensional models. 

Our results therefore show, in particular, that in high-dimensional linear mod- 
els sparsely adaptive confidence sets which arc honest over the whole parame- 
ter space cannot exist. If one is willing to depart from requiring honesty over 
the whole parameter space, our results give weakest possible conditions on the 
parts of the parameter space that have to be removed from consideration. The 
^-separation conditions we study are indeed weaker than other heuristic condi- 
tions that may come to mind: for instance one may find it intuitive to assume 
a lower bound 7„ p on the smallest non-zero entry of 9 £ Bo(ki). In this case 
one has \\8 — B (k )\\ 2 > (ki — k )j 2 p . Now even if one restricts to fairly sparse 
alternatives j3\ = 1/2, and noting p > n,kg = const, one only needs "f np of 
larger order than l/\/n for Theorem 1 to apply. This is of smaller order than 
the rate of sparse adaptive estimation even in the null model. Particularly, no 
sparse estimator will be able to reliably detect nonzero coefficients of such size. 
Indeed, lower bound conditions on mim,- \9j\ are not essential to the problem at 
hand: Rather what is needed is that the k\ — kg smallest squared nonzero coeffi- 
cients sum to a large enough signal that indicates a nonsparse vector. To detect 
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such a signal one typically cannot use sparse estimators, but needs tailor-made 
procedures and tests, very much in the same vein as in sparse signal detection 
(Ingster et al. [2010], Arias-Castro et al. [2011]). These '£ 2 -effects' vanish if one 
requires coverage only for highly sparse alternatives (/3i — > 0), see the results in 
Section 2.2. 

Our results concern confidence sets for the parameter vector 9 itself. Often in- 
stead of 9, inference on X9 is of interest. Under the usual coherence assumptions 
on X that are imposed in the high dimensional inference literature, the quantity 
\\X8\\ compares to with high probability, up to universal constants. Inspec- 
tion of our proofs then shows that our results apply likewise to confidence sets 
for X9. In particular, if Z is a m x p vector, then any honest confidence set for 
a predictor Z9 can be used to solve the testing problem (15) below as long as 
\\Z8\\ > c\\9\\ with high probability, so that lower bounds for sparse confidence 
sets for 9 directly carry over to lower bounds for sparse confidence sets for Z9 
in such situations. 



2. Main Results 

We start with the design assumptions we shall be using for our main results. 
For our lower bounds any i.i.d. design with exponential moments is admissible 
- this condition is taken from Ingster et al. [2010]. 

Condition 1. Consider the model (1) with independent and identically dis- 
tributed (Xij) satisfying EXij = 0,EXfj = 1 Vi, j as well as, for some ho > 0, 

max E(exp(hX lj X 1 i)) = 0(1) V|/i| < h . 
i<j<i<p 

In fact fixed design satisfying these assumptions deterministically could also 
have been used, we refer to Remark 4.1 in Ingster et al. [2010] which applies 
here as well. 

For our upper bounds we shall impose the following sub- Gaussian design as- 
sumption. Let £ := X T X/n and £ := EY,. We also define, in slight abuse of 
notation, \\X6\\l := 9 T t8. 
Condition 2. In the model (1) assume: 

a) The matrix X has independent rows, and for each is {1, . . . , n} and each u £ 
R p with u t YjU < 1, the random variable (Xu)i is sub-Gaussian with constants 
o~o an d Kq: 

KliEcxpiKXu)^ 2 /^} - 1) < ctq, V u T Eu < 1. 

b ) The smallest eigenvalue A^ lin of £ is non-zero. 

For low-dimensional models {p < n) we shall strengthen Condition 2 to bounded 
and independent design, in order to facilitate some technicalities. 
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Condition 3. Consider the model (1) with independent and identically dis- 
tributed (Xij) satisfying \Xij\ < b,EXij = 0,EXfj = 1 and for some 
b>0. 

Note that Condition 3 implies Condition 2 with £ = I and universal constants 
Ko, co: We have (Xu)i — J2m=i XimU m with mean zero and independent sum- 
mands bounded in absolute value by 6|it m |, so that by Hocffding's inequality 

P(\(Xu)i\ >t)< 2 e - t2/2bM i, 

thus {Xu)i is sub-Gaussian and Condition 2 can be checked by integrating tail 
probabilities. 



2.1. Conservative Adaptation to Highly Sparse Signals when p > n 

We now examine more closely the setting of Theorem 1 that resembles the most 
optimistic hopes behind sparse inference procedures: one wants to adapt to a 
highly sparse signal 

eeB (k ), ko^p 1 ^ , 1/2<A)<1, 

where ko grows no faster than y/n, so that 9 belongs to a model of tractable 
dimension in this case. Simultaneously one wishes to be safe against possibly 
only moderately sparse alternatives 

6£B (h), fc 1 ~p 1 -^,0</3 1 < 1/2. 

Moreover one wants this in the situation where sparse methods are most useful, 
when the number of parameters p exceeds n. 

Theorem 2. Assume Condition 1 and that p > n. For < j3i < 1/2 < f3o < 1, 
let p, ko < ki be as above such that 

k = o(y/n/logp), log 3 p = o(n). 

Suppose for some separation sequence p np > and some < a, a' < 1/3, the 
confidence set C np is both honest over Q(p np ) and adapts to sparsity in the sense 
of (7), (8). Then necessarily 

liminf > 0. 

n,p n _i / 4 

An important question is whether the separation rate n -1 / 4 is sharp in this 
setting. The following theorem implies that this is the case at least for general 
sub-Gaussian design matrices, and if we restrict to the natural case where k\ 
is such that consistent estimation in the largest model £?o(fei) is still possible. 
We note that the following theorem holds as well for k\ belonging to the highly 
sparse domain (f3i > 1/2), and for any p. 
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Theorem 3. Assume Condition 2. For < j3i < /3q < 1 let ko < fci be as above 
such that 

fc = o(s/n/\ogp), fci = o(?i/logp). 
Then for every < a, a' < 1 there exists a sequence p np > satisfying 

lim sup _ . < oo 

and a level a-confidence set C np that is honest over &(p np ) and that adapts to 
sparsity in the sense of (7), (8). 



2.2. Restricting to Highly Sparse Signals Only 

One may think that complications in the previous subsection arose because one 
insisted on coverage over too 'unsparse' parameter spaces < 1/2, possibly 
close to 0), and that the problems disappear if one restricts the parameter 
space to highly sparse alternatives £?o(fci) with fci ~ p 1 ~ /3l 7 (3i > 1/2. If the 
rate of estimation in Bo{k{) accelerates beyond n -1 / 4 then indeed one can take 
advantage of this fact, although separation of Bq (fco) and Bq (fci ) is still necessary 
to obtain sparsely adaptive confidence sets. This is summarised in the following 
result. We again consider adaptive honest confidence sets in the sense of (6), 
(7), (8), and the following theorem treats all values of p at once. 
Theorem 4. Let 1/2 < j3x < j3o < 1 and let fco < fci be such that ko ~ 
p 1_/3 °, fci - p 1 -? 1 . 

A ) Assume Condition 1 and that p is such that 

k = o(yfn/ logp), log 3 p = o(n). 

Suppose for some separation sequence p np > and some < a, a' < 1/3, the 
confidence set C np is both honest over Q(p np ) and adapts to sparsity in the sense 
of (7), (8). Then necessarily 

hminf ; ^ ^ > 0. 



' 1 • P 



logpx £,n-V4j 



B) Assume Condition 2 and that 

fc = o(s/n/ log p), fci = o(n/\ogp). 
Then for every < a', a < 1 there exists a sequence p np > satisfying 

lim sup 7 — — r- < oo 



n.p 

mm 



and a level a-confidence set C' np that is honest over Q(p np ) and that adapts to 
sparsity in the sense of (7), (8). 
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2.3. The case p < n approaching standard nonparametric models 



Having seen the effects of weakening the maximal parameter space to be highly 
sparse, let us return to the situation 0i < 1/2 < /?o, and consider the case 
p < n, not analysed yet. The separation rate n -1 / 4 encountered in Theo- 
rems 2 and 3 may seem surprising in light of the L 2 -theory for adaptive con- 
fidence sets in function estimation, developed in Hoffmann and Lepski [2002], 
Robins and van der Vaart [2006], Cai and Low [2006], Bull and Nickl [2012]. The 
phenomena of these papers are tied to the 'nonparametric' situation where some 
decay on the 6*j's is imposed so that fitting models of dimension p < n is ade- 
quate. Note that then p 1 ^ /n 1 / 2 < n -1 / 4 . Although this is not the main setting 
relevant in this article we wish to provide here some insights how the transition 
to the nonparametric theory, and eventually to the parametric one, occurs. 

To do this we will, as is common in many nonparametric problems, evaluate 
performance of confidence procedures relative to parameter spaces that vary in 
fixed £ r -balls of W (r > 1), uniformly in p. Define 



B r (M)=heW:\\9\\ r r = J2\^\ r <M r 

We now require from any confidence set C n that, instead of (6), (7), (8), for 
some fixed < M < oo, r e {1, 2}, 

liminf inf P 9 (9 G C) > 1 - a, (10) 

min(n,p)->oo 8eQ(p np )nB r (M) 



as well as, for some finite constant L > 0, 



nun 



(n,p)->oo 0GB o (k o )nB r (M) 



lim sup sup P e [ \C\\ > Llogp X — ) < a', (11) 



and 



min(n,p)->oo 0^Bo{k\ ,p np )nB r (M) 



limsup sup P e ( \C\j > Llogp x — 1 < a', (12) 



We start with a lower bound that applies to p < n. 

Theorem 5. Assume p < n, let < /3i < 1/2 < /3 < 1,0 < M < oo, and let 
fco < ki be such that ko ~ p 1- ^ , k\ ~ p 1 ^^ 1 . Assume Condition 1, and suppose 
for some separation sequence p np > and some < a, a' < 1/3, the confidence 
set C np is both honest over &(p np ) n B r (M), and adapts to sparsity in the sense 
of (11), (12). If r — 2 then necessarily 

liminf l/ P i np _ 1/2 > 0. 

n,p p ' Tl I 

Moreover, if p = 0(n 2 / 3 ), the same result holds true for r = 1. 
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The question arises whether this lower bound is sharp. A first result confirming 
this is the following. 

Theorem 6. Assume Condition 3 holds and let M > 0. Let kg, fei,/3o,/?i be as 
in Theorem 5 and assume further k\ = o(n/ \ogp). Then for every < a, a' < 1 
there exists a sequence p np > satisfying 

limsup , Pnp , < oo 

and a level a-confidence set C = C{n,p, b, M) that is honest over 0(p np ) fl {0 : 
\\9\\i < M} and that adapts to sparsity in the sense of (11), (12) with r = 1. 

Theorems 5 and 6 parallel Theorem 1 in Bull and Nickl [2012], where instead 
of p 1 / 4 / y/n the separation rate n _r /( 2r+1 / 2 ) occurs, r governing the nonpara- 
metric smoothness constraint on the function fg to estimate. This rate arises 
precisely from balancing p = p* such that the squared 'bias' p~ 2r and the Pe- 
'variance' p 1 / 2 /n (of an estimate of \\fe — /ellli^' G Bo(ko)) arc of the same 
order. In Bull and Nickl [2012] the usual hypothesis r > 1/2 is necessary, which 
means that p* = n 1 /( 2r + 1 / 2 ) = o(n 2 / 3 ), and which also implies, by the Sobolev 
imbedding, sufficient decay of \9j\ such that = 0(1), giving intuitions for 
Theorem 5 (when r = 1) and Theorem 6. Particularly this result approaches, 
for p = const, the parametric theory, where the separation rate equals, quite 
naturally, 1/y/n. [Note, however, that our results formally do require p — > oo, 
possibly arbitrarily slowly] This is in line with the findings in Potscher [2009], 
Potschcr and Schneider [2011] in the p < n setting, where it is pointed out that 
the distributions (asymptotic or not) of a class of specific but commonly used 
sparse estimators cannot reliably be used for the construction of confidence sets. 

Note that (3o > 1/2 > Pi compares to the condition s > 2r in Theorem 1 
in Bull and Nickl [2012], explaining why separation is indeed necessary. If in 
contrast we consider adaptation to a only moderately sparse signal with /3q < 
1/2 then the phenomenon of Theorem l(A)(i) in Bull and Nickl [2012] also 
appears in the regression situation (with some obvious adaptations), and one 
can construct adaptive confidence sets without any removal of parameters for 
certain windows [fcoi^i]- Since these mechanisms are not relevant in the most 
interesting highly sparse problems investigated here, we do not pursue them 
further. 

Rather we return to sparsity considerations, which are still of interest when 
n 2 /3 < p < n, and outside of the usual 'nonparametric' Sobolev-imbedding 
(that is, without imposing ^ 1 -boundedness of the parameter space). The theory 
here is more subtle, but in the key case where one does not want to loose if 
the 'null' model has a fixed finite dimension, and if one considers models of 
dimension at most ^/n, we can generalise the techniques from Bull and Nickl 
[2012] and show again that Theorem 5 is sharp. 

Theorem 7. Assume Condition 3 holds and let M > 0. Let ko,ki,/3i be as 
in Theorem 5 and assume further /3q = l,ki — o(^Jn/ logp). Then for every 



imsart-generic ver. 2011/11/15 file: sparsereg.tex date: September 10, 2012 



/Sparse Regression 



11 



< a, a' < 1 there exists a sequence p np > satisfying 

limsup < oo 

n,p p ' il ' 

and a level a-confidence set C = C(n,p, b, M) that is honest over 0(p„ p ) fl {0 : 
\\9\\2 < afirf i/iai adapts to sparsity in the sense of (11), (12) with r = 2. 

A drawback of the above methods is that knowledge of a bound on M is re- 
quired, which is not estimable. Knowing a bound on M cannot intrinsically 
be circumvented without imposing other qualitative restrictions on 8, unless 
somewhat artificially by 'undersmoothing'. We refer to Bull and Nickl [2012] 
for discussion of these matters and of how to deal with them. 



2-4- Towards constructive procedures 



An important question is whether the existence results for sparse confidence 
sets obtained in the previous sections suggest concrete constructive confidence 
procedures which one could use in practice, and which work over maximal pa- 
rameter spaces S(pnp)- While a general answer to this question is beyond the 
scope of the present paper, we wish to sketch some ideas that transpire from 
our proofs, where we concentrate on the case p > n with coverage required over 
moderately sparse signals (so the setting of Theorems 1 and 3). 

The proof of Theorem 3 is based on first solving the testing problem Hq : 
8 € Bo(ko) vs. H± : 9 € Bo(ki,p), and then centering the confidence set at a 
standard sparse estimator (such as the Lasso), with radius of the confidence ball 
adjusted to the sparsity level selected by the test. See Section 3.2 below. The 
testing problem is solved by considering the statistics 

1 - 

t n (8') = ^=j2^ Y ^( xd '^ 2 -^ T » = a ™LM 9 ')\ 

V2n 0'eB„(k o ) 
and accepting Hq if 

®n = 1 {T„ > u 7 } (13) 

equals zero, where u 7 is a suitable quantile of a Chi-squared distribution. While 
the computation of t n (8') is straightforward, computation of T n involves a com- 
binatorial minimisation problem, and it is natural to look for a convex relaxation 
of it, such as is standard in the construction of sparse estimators (see (33) be- 
low). In practice, one could thus start with a finite family of candidate sparsity 
levels k m , m — 0,...,N, select k rn in an iterative procedure by a suite of the 
above tests, and then proceed as in Section 3.2 below to construct confidence 
balls around one's favourite sparse estimator (such as the Lasso) . A sharp choice 
of the constant L' in the radius requires some probabilistic analysis of the dis- 
tribution of the sparse estimator one is using. [For instance as in Corollary 2 
below, tracking the constants more carefully] 
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In the case where one considers a maximal model that is itself highly sparse one 
can attempt to adapt the higher criticism tests of Arias-Castro et al. [2011], 
Ingster et al. [2010] to the composite situation and proceed similarly. 

Finally, the variance of e is usually unknown and so needs to be estimated. This 
may in principle affect the size of the separation sequences p (see the discussion 
in Ingster et al. [2010]), but under the general hypothesis k\ = o(n/\ogp) does 
not affect the theory drastically. 

3. Proofs 

3.1. Proof of Lower Bounds: Detection Boundaries 

We now prove Theorems 2, 4A and 5 in a unified fashion. The necessity part 
of Theorem 1 follows from Theorem 2 since any i.i.d. Gaussian matrix satis- 
fies Condition 1, and since the assumption on k\ implies the growth condition 
logp 3 = o(n). How to accommodate the £ r -norm restrictions of Theorem 5 is 
discussed at the end of the proof. Except for these £ r -norm restrictions, The- 
orem 2 and 5 can be joined into a single statement with separation sequence 
min^ 1 / 4 ™" 1 / 2 , n" 1 / 4 ), valid for every p. We thus have to consider, for all val- 
ues of p, two cases: the moderately sparse case (3i < 1/2 with separation lower 
bound min(p 1//4 n, _1 / 2 , n -1 / 4 ), and the highly sparse case j3\ > 1/2 with sepa- 
ration lower bound min((logp x (fci/n)) 1 / 2 , n _1//4 ). Denote thus by p* = p* lp 
either min((logp x (fci/n)) 1 / 2 , n -1 / 4 ) or min (p 1//4 n -1 / 2 , rt -1 / 4 ) , depending on 
the case considered. 

The main idea of the proof follows the mechanism introduced in Hoffmann and Nickl 
[2011] which shows that adaptive confidence sets implicitly solve certain testing 
problems, so that in turn it suffices to disprove the existence of consistent tests 
for these problems, for which we adapt results by Ingster et al. [2010] to the 
present, composite situation. Suppose thus by way of contradiction that C is a 
confidence set as in the relevant theorems, for some sequence p = p np such that 



By passing to a subsequence we may replace the liminf by a proper limit, and 
we shall in what follows only argue along this subsequence = n. We claim 
that we can then find a further sequence p np = p, p* > p np > p np such that 



that is, p can be taken to be squeezed between the rate of adaptive estimation 
in the submodel i?o(fco) and the separation rate p* that we want to establish 
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as a lower bound. To check that this is indeed possible we need to verify that 
(logp x (fco/n)) 1 / 2 is of smaller order than any of the three terms 



V 71 



appearing in p* . This is obvious for the first in view of the definition of fco , fci 
and since (3\ < /?o; follows for the second from j3o > 1/2; and follows for the 
third from our assumption fco = o(^/n/ logp) (automatically verified in Theorem 
5 as p < n, P > 1/2). 

For such a sequence p consider testing 

H o :6 = vs. H-i_-.ee Socia- 
lising the confidence set C we can test Hq by 

* = l{CniJi ^0} 

so we reject Ho if C contains any of the alternatives. The type two errors satisfy 
sup E e (l - *) = sup P e {C D Hi = 0) < sup P e (6» £ C) < a + o(l) 

by coverage of C over Hi C @(p) (recall p > p). For the type one errors we 
have, again by coverage, since e B (k ) for any ho, using adaptivity (7) and 
(14), that 

E 1> = P {C n Hi ^ 0) < P (0 6 C, \C\ 2 >p) + a + o(l) = a' + a + o(l). 
We conclude from min(a',a) < 1/3 that 

E ^ + sup E g (l - < a' + 2a + o(l) < 1 + o(l). (15) 

On the other hand we now show 

liminf inf(£7 * + sup - *)) > 1, (16) 



a contradiction, so that 



lim inf — > 



n,p p~ 

necessarily must be true. Our argument proceeds by deriving (16) from Theorem 
4.1 in Ingster et al. [2010]. Let 

< c < 1, b = —i-=, h = — , 

Cy/ki p 

and note that 

n 2 

b 2 ph = l—>p 2 , b 2 k {1 =o(b 2 ph) (17) 
c 
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using that kg = o(k\). Consider a product prior 7r on 6 with marginal coefficients 

°j = he i-. 3 = - ,P, 

where the ej are i.i.d. with P{e j = 0) = 1 - h, P{ej = 1) = Pfo = -1) = /i/2. 
We show that this prior asymptotically concentrates on our alternative space 
Hi = B (ki,p). Let Zj = e 2 and denote by the corresponding order statis- 
tics (counting ties in any order, for instance ranking numerically by dimension), 
then for any 5 > and n large enough, using (17), 

7T(\\0-B a (k Q )\\ 2 <(l + 6)p 2 ) 



< 



< 



which by the proof of Lemma 5.1 in Ingster et al. [2010] converges to 1 as 
min(n,p) — > oo. Moreover that lemma also contains the proof that tt(6 € 
Bo(ki)) — > 1 (identifying k there with our ki), which thus implies n(Bo(ki, p)) — > 
1 as min(?i, p) — > oo. The testing lower bound based on this prior, derived in 
Theorem 4.1 in Ingster et al. [2010] (cf. particularly p. 1487), then implies (16), 
which is the desired contradiction. Finally, for Theorem 5, note that the above 
implies immediately that 9 ~ ir asymptotically concentrates on any fixed £ 2 - 
ball. Moreover, EV^Hi = bph = o(l) under the hypotheses of Theorem 5 when 
p = (9(n 2 / 3 ), and likewise Varv(||0||i) = b 2 ph, so we conclude as in the proof 
of Lemma 5.1 in Ingster et al. [2010] that the prior asymptotically concentrates 
on any fixed £ 1 -ball in this situation. 

3.2. Proofs of Upper Bounds: Composite Testing Problems 

We follow Hoffmann and Nickl [2011] and Bull and Nickl [2012] in constructing 
upper bounds. The main mechanism behind the proofs for upper bounds is to 
solve the composite testing problem 

H :eeB (k ) vs. HuBeBoQeup) (18) 

under the parameter constellations of fco, k\,p,p, n relevant in Theorems, 3, 4B, 
6 and 7. Once a minimax test ^ is available for which type-one and type-two 
errors 

sup E e ^ n + sup E s (l - * n ) < 7 (19) 
eeHa eeffi 
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can be controlled, for n large enough, at any level 7 > 0, one simply centers the 
confidence set at a sparse estimator with radius the rate of estimation at the 
sparsity level selected by the test, seen as follows: 

Take 9 to be the estimator from (33) below with A chosen as in Lemma 4, and 
let, for < L' < 00, 



<L\\ogp^\ iftf„ = 



9:\\9-9\\ 2 <L'J\ogp^} if *„ = 1 



Assuming (19) we now prove that C n is honest for E>o(ko) U Bo(k\, p np ) if we 
choose L' large enough. For 9 6 Bo(ko) we have from Corollary 2 below, for L' 
large, 



inf P e {9eC n } > 1- sup Pe{\\8-6\\ 2 > L\l log p— 
asm 00. When 9 G Bo(ki, p np ), we have that Pg {9 s C n } exceeds 



sup P g U\9-9\\ 2 > L\llogp^\ - sup P e {tf n = 0}. 

eeBo(fci) t V n J ee s o(fcliP?ip) 

The first subtracted term converges to zero for L' large enough, as before. The 
second subtracted term can be made less than 7 = a, using (19). This proves 
that C n is honest. We now turn to sparse adaptivity of C„: by the definition 
of C n we always have |C„| < L'yjlogp x fci/n, so the case € Bo(ki,p np ) is 
proved. If € -Bo(^o) then 



Pfl ||C n | > L'y io g^| = P «i*« = !> ^ a '> 

by the bound on the type-one errors of the test, completing the proof of existence 
of an adaptive confidence set, assuming (19). 



3.2.1. The n 1 ' 4 -regime: Proof of Theorem 3 



Throughout this subsection we impose the assumptions from Theorem 3, with 
Pnp > Lon^ 1 / 4 for some Lo large enough that we will choose below, and we 
wish to solve the testing problem (19) with this choice of p, for any level 7 > 0. 
Define 

1 ™ 

t n {9') = — J2[(Yi - (X9%) 2 1], T n = inf MOI 
V2n , B'eB (k ) 
v %— 1 
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and the test 



*» = l{T n > u 7 } 



(20) 



where u 7 is suitable fixed quantile constant such that, for every 9 € E>o(ko), 
E e ^n = Pe{T n > u 7 ) < P e (\t n {9)\ > u 7 ) 

1)>%) <7- 



(21) 



For the type-two errors 9 E Hi, let 0* be a minimiser in T n (if the infimum is 
not attained the argument below requires obvious modifications). Then 



U9*) = Y J m-{x9*) l f-i] 

n 

= Y / m-(X9) t + (X9) l -(X9*) l ) 2 ~l] 

n 

= Y^[(Yi - (X9)i) 2 - 1] + 2(Y - Xfl,X(# - ff*)) + \\X(9 

i=l 

so the type two errors E$(l — fy n ) are controlled by 



3* Ml 2 



^[(Y - (X^O 2 - 1] + 2(Y- X6,X{6- 9*)) + \\X(9-9*)\\ 
X(9~9*)\\ 2 ^ \ 



< y2nu 



E( 

i=l 



> 



(22) 



Pe ( \2{e,X{9-9*))\ > l|X(g o r)l|2 -Vi^ 



Since 0* € -Bo(fco) and fco = o( rl /l°gp) we have, from Corollary 1 below with 
t = ki logp that, for n large enough and with probability at least 1 — 4e _fel logp , 



\X( 



I 2 > inf \\X( 

8><EH 



3'M|2 



> c(A min )rip 2 [ > L'\fn 



(23) 



for every L' > if we choose Lo large enough. The probability in the last but 
one line of the display (22) is thus bounded by 



Po 



> Vn(L' 



4e 



-fei logp 



which, for n large enough, can be made as small as desired by choosing L' > 4it 7 , 
as in (21). Likewise the estimate (23) implies that the last probability in the 
display (22) is bounded, for n large enough, by 4e- fellogp plus 



Pe (\2(s,X(6-e*))\ > 



\X( 



< Pe ( sup 

V&H \\X 



2\{e,X{6-e>)\) ^ 1 



Oil 2 > 4, 
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which converges to zero for large enough separation constant Lq, proved in 
Lemma 2 below (noting that the exponential bound there is independent of X, 
using Corollary 1 to lower bound \\X(8 — 0')\\ , and that y/k$ logp/n = o(n~ 1//4 )). 



3.2.2. The y log p^ -regime: Proof of Theorem 4B 

Throughout this subsection we impose the assumptions from Theorem 4, with 
p np exceeding Lq \J (ko/n) logp for some Lq large enough that we will choose 
below (the n _1 ' 4 -regime was treated already in Theorem 3). We wish to solve 
the testing problem (19) with this choice of p, for any level 7 > 0. In this regime 
a simple plug-in test approach works. Let be the estimator from (33) below 
with A chosen as in Corollary 2 below, and define the test statistic 

T n = inf ||0-0|| 2 , *„ = l(T„>Clogp^ 

for D to be chosen. The type-one errors satisfy, uniformly in <G Ho, for D large 
enough 

E e ^ n <Pe (\\9-8\\ 2 >D\ogp 1 ^] -> 



as min(p, n) — > 00, by Corollary 2. Likewise, under the alternatives 9 € Bo(^i, p) 
we have, for some 9* € Bq^q), by the triangle inequality, 



E e (l - *„) = P e I \\6- < Clog/ 1 



<P e \\\8-9\\>\\9* -9\\- x lClogp- 1 



n 



<Pg[\\9-9\\ 2 >(L Q -C)logp^- 



11 



for Lq large enough, again by Corollary 2 below. 
3.2.3. The p 1 1 / \fn-regime : Proof of Theorems 6, 7 



Throughout this subsection we impose the assumptions from Theorem 6 and 7, 
with p np > Lop^-^/y/n for some L large enough that we will choose below, and 
we wish to solve the testing problem (19) with this choice of p, for any level 
7 > 0. For 0' £ W we define the [/-statistic 



2 



\ > i<k 3=1 
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which equals ||n _1 X 7 V — 9'\\ 2 with the diagonal terms (i = k) removed. We 
note 

^E B X T Y = Eg (j-X T X^j 6 = 6, EgYiXij = 6j (24) 

and thus 

E e U n {6') = \\9-8'\\ 2 , 

so this [/-statistic estimates the squared L 2 -distance of 6' to the unknown 6. 
Letting 

T„ = inf \U n (6')\ 

f'GBo(fco) 

we define the test 

for w 7 quantilc constants specified below. 

For type-one errors we have, uniformly in H , by Chebyshev's inequality 
Ee* n = Pe ( T n > u^) < P g ( \U n (0)\ > u^) < Y^M±. (25) 

Under Pg the [/-statistic U n {8) is fully centered (cf. (24)), and by standard 
[/-statistic arguments the variance can be bounded by Varg(U n (6)) < Dp/n 2 
for some constant D depending only on M and maxi<j< p EXfj < 6 4 , see, for 
instance, display (6.6) in Ingster et al. [2010] and the arguments preceding it. 
We can thus choose w 7 = u 7 (M, b) to control the type-one errors in (25). 

We now turn to the type- two errors: Let 6* be a minimiser in T„, then U n {6*) 
has Hocffding decomposition 

U n (6*) = U n (6) + 2L n (6* ) + \\6* - 6\\ 2 

with linear term 

71 P 



n p 



i=l j=l 

We can thus bound the type two errors Eg(l — & n ) as follows: 
Pg [T n < u^) < Pg ( \U n (9)\ + 2\L n (6*)\ >\\6-6*\\ 2 - 



n 



< Pe \U n {6)\ > 



-6*\\ 2 y/p 



— u~. 



+ P e [\L n (6*)\> " A " - Ul 



^ 2n 

An 
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By hypothesis on p np we can find Lq large enough such that 



\9-9*\\ 2 > inf \\e-6'\\ 2 >L^ 
0'eHo n 



for any L > 0, so that the first probability in the previous display can be 
bounded by 



U n (6)\ > u 



which involves a fully centered t/-statistic and can thus be dealt with as in the 
case of type-one errors. The critical term is the linear term, which, by the above 
estimate on \\9 — 0*\\, is less than or equal to 

The process L n (9') can be written as 

(9 - n- 1 X T Y, 9-9') = (9- n _1 X T JO, 9-9')- ( n - 1 X T e, 9 - 9') 

= -({E X T X - X T X)9,9 - 9') - -(e,X(9-9')) 
n n 

and we can thus bound the last probability by 

-Wfd'M 1 \ / I r( 2 h 



To show that the probability involving the second process approaches zero it 
suffices to show that 

\e T X{9-9')/n\ 

' J* > tttt (27) 



'eH 



\\X(9-9>)\\ 2 /n 




converges to zero, using that sup ugB / fci ) ||Xu|||/(n|| v\\?,) < A for some < 
A < oo, on events of probability approaching one, by Lemma 1, using also ki = 
o(n/ \ogp). By Lemma 2 this last probability approaches zero as min(n,p) — > oo, 
for Lq large enough, noting that the lower bound on R t there is satisfied for 
our separation sequence p np , by Corollary 1 and since (fco/n)logp = o{p 1 / 2 /n) 
in view of /3q > 1/2. Likewise, using the preceding arguments with Lemma 3 
instead of Lemma 2, the probability involving the first process also converges 
to zero, which completes the proof. 



3.3. Remaining Proofs 

Lemma 1. Assume Condition 2a. Then for some constants a and k depending 
only on o~o and Ko, and for all 9 £ W , all k £ {1, . . . ,p} and all t > 0, it holds 
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that 



P\ sup 

e'eBoQe), (e'-e) T T,(0>-e)^o 



{9' - 9) T £{9' - 9) _ 
{9' - 9) T Y,{9< - 9) 



> 4 J* + <* + 1) + 4*' + <* + 1} 1 ° S{25P) ) < 4ex P [-t]. 
n n J 

Corollary 1. Let X satisfy Conditions 2a and 2b. Let a and k be defined as in 
Lemma 1. Suppose that k S {1, . . .,p} and t > are such that 



1 



/ 8(fc + l)log(25p) v 8A < 
\ n n J ~ \4(a V k) 



A 1 



Then for all 9 eW 



9' - 9) T t(9' -6)>-\\e'- 9\\ 2 A 2 min Vfl'e B Q (k) ) > 1 - 4exp[-i] 



Proof of Lemma 1. Fix a set S C {1, . . . ,p} with cardinality |5| = k. Let 
R p s := {9 6 W : 9j = V j <£ S}. We will show that 



P 



sup 

i p s , (e'-e) T s(e'-e)^o 



(9' - 9) T £{9' 



9' - 9) T T,{9' - 



t + 2{k+ l)log5 . f + 2(fc + l)log5\ , . . 

> Aa\l — — +4k — — < 4exp \-t ). 

n n J 

Since there are (?) < p k sets S of cardinality k, the result then follows from the 
union bound. 

We now show that without loss of generality, one may assume 9 = 0, provided 
in the end, one replaces fcbyfc + l,pbyp+l, and adds a column to the matrix 
X. To see this, define for 9' e R p s , 9' := 9' - 9 S and X := (X,X S .9 S ,). Here, 
0j,s = e S}, j = l,...,p and 9 S a = 9 - 9 S . Moreover, define {9') T := 

(9', -1) T . Then lisanx (p + l)-matrix, and & € K|, where S := SU{p+ 1}. 
Thus 9' is (k + l)-sparse (i.e. \S\ = k + 1). Moreover X{6' -9)= X9'. 

The above argument means we only have to show 
{9') T ±9' 



P 



sup 

1%, (6') T Y,9'^0 



1 



, t + 2/clog5 t + 2fclog5\ 
> 4cm/ — + 4 K 2_ | < 



4exp[-f]. 



But this is the same as showing 



p\ sup \ { 9rm>^ lt+2klos5 ' ^ +2fclog5 



4k- 



<4exp[-t], (28) 
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where B s := {(0' G R p s : (0') T E0' < 1} and $ := E - E. 

We use the notation := u T Xu, u £ M. p , and we let let for < 5 < 1. 

{X6 1 S }*JP be a minimal ^-covering of {{X& : 0' G B s }, \\ ■ || s ). Thus, for every 
ff £ S s there is a 0* = 6 l s (6') such that j|X(0'-0*)l|s < 8. Note that {0^,} c M? s . 

Following an idea of Loh and Wainwright [2012] we then have 

sup |(0' - 0^(0')) T $(0' - 0s(0'))l < S 2 sup tf T $tf, 

6>eB s i?6Bs 



and for any fixed 



sup |(0'-0^(0')) T $0| <S sup |tf T $0| 

8'GBs i?£B s 



This implies that 



sup \ { erm < ^^1(0^1 i ^ yl^f^i 



With 5=1/2 this says that 



sup |(0') T 1>0'| < |max|(0^) T $^| + |max|(0^) T $^| 



Condition 2a ensures that for some constants a and k depending only on cto 
and Ko, and for any u and v with < 1 and ||-X"i>||e < 1, and any t > 0, 

it holds that 

P\ \u T <$>v\ > a\ - +«;-)< 2cxp[-<]. 
\ V n n J 

This follows from the fact that the ((Xu)i) and ((Xv)i) are sub-Gaussian, 
hence the products ((Xu)i(Xv)i) are sub-exponential. Bernstein's inequality 
can therefore be used (see Bennet [1962] and for the form presented above, e.g. 
Biihlmann and van de Geer [2011], Lemma 14.9). Finally, the covering number 
of a ball in fc-dimensional space is well known. Apply for example Lemma 14.27 
in Biihlmann and van de Geer [2011]: N(S) < ((2 + S)/S) k . If we take S = 1/2 
this gives N(l/2) < 5 fc . The union bound then proves (28). □ 



3.3.1. A ratio-bound for 0' i-> e T X{9 - 0') 



Lemma 2. Suppose that e ~ N(0,I) is independent of X . Let 6 > 0. Then 
for any t > max(l/5, 1), and for R t = tCoyko logp/n where Co is a universal 
constant, we have 



sup Pg I sup 

WeBo(fco), ||X( 



\>Rt 



\e T X{e-6')\/n 

\\x{e-e>)\\i 



> 6 



X 
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Co 



< C\ exp 

for some universal constants C\ and Ci . 
Proof. Let 

g R {6) := {9' : \\X(0 - 9')\\ n < R, 9' e Bo(fco)}. 

Then, using the bound log ) < fco logp and, e.g., Lemma 14.27 in Biihlmann and van dc Geer 
[2011] we have 



H(u,{X(9-9>) : 9' E Qr(0)},\\ ■ L) < (fco + 1) log 



2R + u 



fco \ogp, u > 0. 



Indeed, if we fix the locations of the zero's, say 9' G B' Q (ko) := : ■dj=0\/j> 
fco}, the space {X9' : 9' e B' Q (k )} is a fc -dimensional linear space, so that 



H(u, {X6' : 6' e B' (k ), \\X6'\\ n < R}, \\ ■ ||„) < fc log 



2i? + tt 



, w > 0. 



Furthermore, the vector X8 is fixed, so that Qr{9) is a subset of a ball with 
radius R in the (fco + l)-dimensional linear space spanned by {Xj}^ 1 and X9. 

By Dudley's bound (see Dudley [1967], or more recent references such as van der Vaart and Wcllncr 
[1996], van de Geer [2000]), applied to the (conditional on X) Gaussian process 
9' ^ e T X{9 - 6'), we obtain 



< C / y/nH(u,g R (0),\\ ■ \\ n )dv 



E 


sup \e T X{0-9')\ 


X 


<C f 




.e'eg R (e) 




Jo 



< Cy/2k \ogp^R, 

for some universal constants C > 1 and C . By the Borcll-Sudakov-Cirelson 
Gaussian concentration inequality (e.g., Massart [2003]), we therefore have for 
all u > 0, 



P sup 

\e'e9 R (0) 



\e T X{9-9')\/n>CR\j 



2fc log p 



X < exp[— u]. 



P sup \s T X{ 
\0'eg R (e) 



Substituting u = v 2 ko \ogp gives that for all v > 

Q')\/n> (C + v)R 
which implies that for all v > 1, 



2fc log p 



X I < cxp[— v 2 k logp] 



P\ sup \e T X(9- 9')\/n > 2vCR 



2fc logp 



X I < exp[— v 2 k logp]. 
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Now insert the peeling device (see Alexander [1985], the terminology coming 
from vande Geer [2000], Section 5.3). Let R t := 8Ct^2k„ logp/n. We then 
have 

p( snp \e T X(8-9')\/n ^ 5 \ 

Ve'eBo(feo), \\X(8-8')\\>R t \\ X (^-^')\\n ~~ / 

Vpf sup \e T X(6-6')\/n> 62 2 ( s -VR*x) 



< 



V? sup \e T X(9-9')\/n> 2 s R t x 2C{2H8) 
~y \e'eg 2 s Rt (e) 



2k log p 



X 



< ^exp[-2 2s i 2 <5 2 A:ologp] < Ci exp 



t 2 6 ko \ogp 



C 2 



for some universal constants C\ and Ci, completing the proof. 



3.3.2. A ratio-bound for 9' i-> ((E e X T X - X T X)8, 9 - 9') 

Lemma 3. We have, for every 6 > 0, Rt = tD± \J ko logp/n, t > 1, some 
positive constants D\, D2, D3, D4, D5 depending on S, that 

supP *f sup W-^¥ >6 )- B{t ^ n) 

9 \8>eB (k ):\\8-8'\\>R t \\0 - V II / 

where B(t,p,n) = D2e~ D3t s k ° l °sp under the assumptions of Theorem 6 and 
B(t,p,n) = £) 4e -A>t<5V«T5Ii/fci una * e r the assumptions of Theorem 7. 

Proof. The process in question is of the form 

n p p 

4 1] :^-EE ( Z « - EZ « ) & - 6 J ) > Z * = E OmXimXij . (29) 
i— 1 j — 1 m—1 

Since the Xij are uniformly bounded by b, we conclude that the summands in 
i of this process are uniformly bounded by 

7 — 1 m= 1 
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and the weak variances equal, for 5 m j the Kronecker delta, 

nVar e (l^(9')) = E^Z^ - EZ^Zu - EZ U )(9 3 - 9 , j ){6 l - 9[) 

3,1 

= E E (X im Xij — 5 m j)(X im ,Xu — S m /i)9 m m '(6j — 6^)(# ; 
= E D m j m 'i6 m 6 m r(6j — 6j)(6i — 9[) 

< cll^lllH^ — (31) 

where we have used, by the design assumptions, that D m j m n < 1 whenever 
the indices m,j, m',l match exactly to two distinct values, D m j m /i < EXf x if 
m = I = j = m', and -D mjm '; = in all other cases, as well as the Cauchy- 
Schwarz inequality. 

Therefore Ln is a uniformly bounded empirical process {(P n — P)(fe')}e'eH 
given by 



-y n p p 

— y^Cffl'Q^) — Efo'(Zi)), fe>{Zi) = E 2_j QmX im X i: j(6j - 6'j] 

%—\ j — l m—l 

with variables Z< = {X a ,. . . , X ip ) T G W. Define 

Ts = {/ = fe> ■ 0' G H , \\9' - 9\\ 2 < 2 S+1 }. 



We know R t < \\9 — 0'\\ < V C so the first probability in (26) can be bounded, 
for d > a small constant, by 

u ( \Ln\e')\ \ 

Pe max sup — — — T > 5 

\^seZ:c'ii=<2»<C 9 ' e Ho,2»<||0-0'||=<2-+i \W ~ ° \\ J 

< E p o\ su p \LU(6')\>rs) 



s(EZ:c'i??<2 s <C 



'ei-u,.\\i)-i)'\\-s2- i 1 



sGZ:c'Rj<2 s <C 

Moreover, J^ 8 varies in a linear space of measurable functions of dimension k , 
so we have, from log J < kg \ogp and from Theorem 2.6.7 and Lemma 2.6.15 
in van der Vaart and Wellner [1996] that 

H{u, F s , L 2 {Q)) < k \og(AU/u) + k logp, < u < UA, 

for some fixed constant A and envelope bound U of F s . Using (30), if 9,9' are 
bounded in i 1 by M we can take U a large enough fixed constant depending 
on M, b only, and if ko is constant we can take U = max(fciV / 2^, 1) since \\9 — 
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8' || i < v / ^i]|0 — ^'Ib- The moment bound for empirical processes under a uniform 
entropy condition (Theorem 3.1 in Gine and Koltchinskii [2006]) then gives, 
using (31), 

77-1 1 1 r) on ^ /2 s fc , , Uk \ogp 

E\\Pn ~ P\\r 3 < \ logpH (32) 

V n n 

which is, under the maintained hypotheses, of smaller order than 2 S S precisely 
for those s such that R 2 ~ (ko/n) \ogp < 2 s . The last sum of probabilities can 
thus be bounded, for D\ large enough and Co some positive constant, by 

J2 Pe (n\\P n - P\\r, - nE\\P n - P\\^ > c Q n2 s S) , 

s6Z:c'_R2<2 s <C 

to which we can apply Talagrand's inequality Talagrand [1996] (as at the end 
of the proof of Proposition 1 in Bull and Nickl [2012]), to obtain the bound 

v / c 2 n 2 {2 s ) 2 1 

^ eXP 1 n2 s + l + nUE\\P n - P\\jr + Uc n2 s 6 J ' 

Using (32) this gives the desired bound Die~ Dzt s k ° logp when the envelope U 
is constant, and the bound B(t,p, n) = D 4 e~- D5 *' 5 ( nlos?> ) 1 2 / kl when the envelope 
is U = max(fciV / 2*, 1) (with ko constant), completing the proof. □ 



3.3.3. Tail Inequalities for Sparse Estimators 



Recall that S$ := {j : $j ^ 0}. Let k$ := \S#\. For A > 0, take the estimator 

8 := argmin|||y-Xi?|||/n+ \ 2 kA. (33) 

Lemma 4. Let e ~ M(0,I) be independent of X . Take A 2 = C^logp/n where 
C3 is an appropriate universal constant. Let t > 1 be arbitrary and R t := y/t/n. 
Then for some universal constants C4 and C5, 

sup P e [\\X{6 - 9)\\l + X 2 k § > 2X 2 k Q + R 2 

OeBo(ko) \ 

Proof. The result follows from a oracle inequality for least squares estimators 
with general penalties as given in van de Geer [2001]. For completeness, we 
present a full proof. Define 

r 2 {$-6) -^wx^-e^l + x 2 ^. 

Let 

Q R {6);={Q: t 2 {$)<R}. 
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If t 2 (8; 8) < 2X 2 kg we are done. So suppose t 2 (8; 9) > 2X 2 k g . We then have 

(2/n)e T X(9 - 9) > t 2 (9,9) - X 2 k g > t 2 {9,9)/2 
Now again apply the peeling device: 



P 



e T X(d-9)/n 1 



x 



oo 



sup 



s=l 



But if G Sr(0), we know that - 6») || n < P and that k# < R 2 /X 2 . Hence, 

as in the proof of Lemma 2, we know that 



P sup e T X{§ ~9)/n> 2CR 



2R 2 logp 



iX 2 



X < exp 



C 2 P 2 log p 



X 2 



As A = 32Cy/2 logp/n, we get 



/ 7? 

P sup e T X(i9- 6»)/n > — 



X < exp 



nBf 



2 x (32) 2 



We therefore have 



P 



sup 



e T X(d-8)/n 1 
r 2 (d,9) - 4 



< 



^exp 



n2 2s R 2 



< Ca exp 



nR 2 



□ 



2 x (32) 2 

for some universal constants C4 and C5 
Corollary 2. Assume Condition 2 and let e ~ A/"(0, J) 6e independent of X . 
Let 9 be as in (33) with X 2 = (C3logp)/n where C3 is as in Lemma 4, and let 
ka = o(nj logp). Then for some universal constants Cq, C7, Cg, c, every C > Cq 
and every n large enough 



sup P e ( ||0 

eeB (fco) 



> C 



k \ogp 



< CVexp 



k \ogp 



Proof. By Lemma 4 with P T , r equal to a large constant times ko logp, we see 
first ks < 3fco on the event on which the exponential inequality holds. Then 
from Corollary 1, on an event of sufficiently large probability, 

\\9-8\\ 2 <C(A min )\\X(9- 8)111 

for n large enough, so that the result follows from applying Lemma 4 again (this 
time to || A(# — 6*)|| 2 ), and from combining the bounds. □ 
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